MP968 Experimental Design Workshop

Leighton Pritchard

University of Strathclyde

2025-11-24

Why do we need experimental design?

We should not cause unnecessary suffering

We should always minimise suffering

This may mean not performing an experiment at all. Not all new knowledge or understanding is worth causing suffering to obtain it.

Where there is sufficient justification to perform an experiment, we are ethically obliged to minimise the amount of distress or suffering that is caused, by designing the experiment to achieve this.

Why we need statistics

It may be easy to tell whether an animal is well-treated, or whether an experiment is necessary.

But what is an acceptable (i.e. the least possible) amount of suffering necessary to obtain an informative result?

Challenge

Quiz question

Suppose you are running a necessary and useful experiment with animal subjects, where the use of animals is morally justified. You are comparing a treatment group to a control group. Which of the following choices will cause the least amount of suffering?

  • Use three subjects per group so a standard deviation can be calculated
  • Use just enough subjects to establish that the outcome is likely to be correct
  • Use just enough subjects to be certain that the outcome is correct
  • Use as many subjects as you have available, to avoid wastage

How many individuals?

The appropriate number of subjects

The appropriate number of animal subjects to use in an experiment is always the smallest number that - given reasonable assumptions - will satisfactorily give the correct result to the desired level of certainty.

  • What assumptions are reasonable?
  • What is an appropriate level of certainty?

By convention1 the usual level of certainty for a hypothesis test is: “we have an 80% chance of getting the correct true/false answer for the hypothesis being tested”

Design experiments to minimise suffering

Experimental design and statistics are intertwined

Once a research hypothesis has been devised:

  • Experimental design is the process of devising a practical way of answering the question
  • Statistics informs the choices of variables, controls, numbers of individuals and groups, and the appropriate analysis of results

Design your experiment for…

  • your population or subject group (e.g. sex, age, prior history, etc.)
  • your intervention (e.g. drug treatment)
  • your contrast or comparison between groups (e.g. lung capacity, drug concentration, etc.)
  • your outcome (i.e. is there a measurable or clinically relevant effect)

The 2009 NC3Rs systematic survey

The importance of experimental design

“For scientific, ethical and economic reasons, experiments involving animals should be appropriately designed, correctly analysed and transparently reported. This increases the scientific validity of the results, and maximises the knowledge gained from each experiment. A minimum amount of relevant information must be included in scientific publications to ensure that the methods and results of a study can be reviewed, analysed and repeated. Omitting essential information can raise scientific and ethical concerns.” (Kilkenny et al. (2009))

We rely on the reporting of the experiment to know if it was appropriate

Causes for concern 1

“Detailed information was collected from 271 publications, about the objective or hypothesis of the study, the number, sex, age and/or weight of animals used, and experimental and statistical methods. Only 59% of the studies stated the hypothesis or objective of the study and the number and characteristics of the animals used. […] Most of the papers surveyed did not use randomisation (87%) or blinding (86%), to reduce bias in animal selection and outcome assessment. Only 70% of the publications that used statistical methods described their methods and presented the results with a measure of error or variability.” (Kilkenny et al. (2009))

We cannot rely on the literature for good examples of experimental design

Causes for concern 2

No publication explained their choice for the number of animals used

We cannot rely on the verbal authority of ‘published scientists’ or ‘experienced scientists’ for good experimental design

Very strong cause for concern

Power analysis or other very simple calculations, which are widely used in human clinical trials and are often expected by regulatory authorities in some animal studies, can help to determine an appropriate number of animals to use in an experiment in order to detect a biologically important effect if there is one. This is a scientifically robust and efficient way of determining animal numbers and may ultimately help to prevent animals being used unnecessarily. Many of the studies that did report the number of animals used reported the numbers inconsistently between the methods and results sections. The reason for this is unclear, but this does pose a significant problem when analysing, interpreting and repeating the results.” (Kilkenny et al. (2009))

Important

As scientists, you - yourselves - need to understand the principles behind the statistical tests you use, in order to choose appropriate tests and methods, and to use appropriate measures to minimise animal suffering and obtain meaningful results.

You cannot simply rely on the word of “experienced scientists” for this.

The ARRIVE guidelines

The following year Kilkenny et al. (2010) proposed the ARRIVE guidelines: a checklist to help researchers report their animal research transparently and reproducibly.

  • Good reporting is essential for peer review and to inform future research
  • Reporting guidelines measurably improve reporting quality
  • Improved reporting maximises the output of published research

ARRIVE guidelines highlightes

Many journals now routinely request information in the ARRIVE framework, often as electronic supplementary information. The framework covers 20 items including the following (Kilkenny et al. (2010)):

ARRIVE guidelines (highlights)

    1. Objectives: primary and any secondary objectives of the study, or specific hypotheses being tested
    1. Study design: brief details of the study design, including the number of experimental and control groups, any steps taken to minimise the effects of subjective bias, and the experimental unit
    1. Sample size: the total number of animals used in each experiment and the number of animals in each experimental group; how the number of animals was decided
    1. Statistical methods: details of the statistical methods used for each analysis; methods used to assess whether the data met the assumptions of the statistical approach
    1. Outcomes and estimation: results for each analysis carried out, with a measure of precision (e.g., standard error or confidence interval).

A vital step

Warning

“A key step in tackling these issues is to ensure that the next generation of scientists are aware of what makes for good practice in experimental design and animal research, and that they are not led into poor or inappropriate practices by more senior scientists without a proper grasp of these issues.”

Recommended reading

Bate and Clark (2014)

Some Statistical Concepts

Random variables

Your experimental measurements are random variables

Important

This does not mean that your measurements are entirely random numbers

Caution

Random variables are values whose range is subject to some element of chance, e.g. variation between individuals

  • Tail length (e.g. timing of developmental signals, distribution of nutrients)
  • Blood concentrations (e.g. circulatory heterogeneity, transient measurement differences)
  • Survival time (e.g. determining point of death)

Probability distributions

The probability distribution of a random variable \(z\) (e.g. what you measure in an experiment) takes on some range of values1

The mean of the distribution of \(z\)

  • The mean (aka expected value or expectation) is the average of all the values in \(z\)
    • Equivalently: the mean is the value that is obtained on average from a random sample from the distribution
  • Written as \(\mu_{z}\) or \(E(z)\)

The variance of a distribution of \(z\)

  • The variance of the distribution of \(z\) represents the expected mean squared difference from the mean \(\mu_z\) (or \(E(z)\)) of a random sample from the distribution.
    • \(\textrm{variance} = E((z - \mu_z)^2)\)

Understanding variance

A distribution where all values of \(z\) are the same

  • Every single value in the distribution (\(z\)) is also the mean value (\(\mu_z\)), therefore

\[z = \mu_z \implies z - \mu_z = 0 \implies (z - \mu_z)^2 = 0\] \[\textrm{variance} = E((z - \mu_z)^2) = E(0^2) = 0\]

All other distributions

In every other distribution, there are some values of \(z\) that differ so, for at least some values of \(z\)

\[z \neq \mu_z \implies z - \mu_z \neq 0 \implies (z - \mu_z)^2 \gt 0 \] \[\implies \textrm{variance} = E((z - \mu_z)^2) \gt 0 \]

Standard deviation

Standard deviation is the square root of the variance

\[\textrm{standard deviation} = \sigma_z = \sqrt{\textrm{variance}} = \sqrt{E((z - \mu_z)^2)} \]

Advantages

  • The standard deviation (unlike variance) takes values on the same scale as the original distribution
    • Standard deviation is a more “natural-seeming” interpretation of variation

Note

We can calculate mean, variance, and standard deviation for any probability distribution.

Normal Distribution 1

\[ z \sim \textrm{normal}(\mu_z, \sigma_z) \]

Note

We only need to know the mean and standard deviation to define a unique normal distribution

Tip

Measurements of variables whose value is the sum of many small, independent, additive factors may follow a normal distribution

Important

There is no reason to expect that a random variable representing direct measurements in the world will be normally distributed!

Normal Distribution 2

Tip

  • For a normal distribution, the mean value is the value at the peak of the curve
  • The curve is symmetrical, so standard deviation describes variability equally well on both sides of the mean

(Non-)Normal Distribution 3

Tip

  • Here, the mean may not be the same value as the peak of the curve (i.e. the mode)
  • The curve is asymmetrical, so standard deviation does not describe variation equally well on either side of the mean

Binomial Distribution 1

Suppose you’re taking shots in basketball

  • how many shots?
  • how likely are you to score?
  • what is the distribution of the number of successful shots?

Tip

This kind of process generates a random variable approximating a probability distribution called a binomial distribution.

It is different from a normal distribution.

Binomial Distribution 2

\[ z \sim \textrm{binomial}(n, p) \]

Tip

  • number of shots, \(n = 20\), probability of scoring, \(p = 0.3\)

\[z \sim \textrm{binomial}(20, 0.3) \]

mean and sd

\[ \textrm{mean} = n \times p \] \[ \textrm{sd} = \sqrt{n \times p \times (1-p)}\]

Design note

You need to design your experiments and analyses to reflect the appropriate process/probability distributions of your data. E.g., does \(p\) differ between two conditions?

Poisson distribution 1

In prior experiments the frequency of calcium events in WKY was 3.8 \(\pm\) 1.1 events/field/min compared to 18.9 \(\pm\) 7.1 in SHR

This is not normal (or binomial)

Something that happens a certain number of times in a fixed interval generates a Poisson distribution.

This is different from a normal or binomial distribution.

Poisson distribution 2

\[z \sim \textrm{poisson}(\lambda)\]

Poisson distribution

\[ \textrm{mean} = \lambda \] \[ \textrm{sd} = \sqrt{\lambda} \]

Expectation (\(\lambda\))

  • Only one parameter is provided, \(\lambda\): the rate with which the measured event happens

  • Suppose a county has population 100,000, and average rate of cancer is 45.2mn people each year

\[z \sim \textrm{poisson}(45,200,000/100,000) = \textrm{poisson}(4.52) \]

Design note

You need to design your experiments and analyses to reflect the appropriate process/probability distributions of your data

  • E.g., does \(\lambda\) differ between two conditions?

Binomial and Poisson distributions

Some important features

  • All measured values (and \(n\)) are positive whole numbers or zero; \(\lambda\), \(p\) may be positive real numbers or zero
  • The distributions may not be unimodal
  • The mean is not always the peak value (mode)
  • The distributions are not always symmetrical (so sd may not describe variation equally either side of the mean)

Distributions in Practice

Distributions are starting points

  • Distributions arise from and represent distinct generation processes (relate this to your biological system)
    • Normal distributions are generated by sums, differences, and averages
    • Poisson distributions are generated by counts (per unit interval)
    • Binomial distributions are generated by success/failure outcomes
  • Design experiments with analyses that reflect these processes

Warning

  • All statistical distributions are idealisations that ignore many features of real data
  • No real world data should be expected to exactly match any statistical distribution
  • Poisson models tend to need adjustment for overdispersion

Normal Distribution Redux

Probability mass

  • approximately 50% of the distribution lies in the range \(\mu \pm 0.68\sigma\)
  • approximately 68% of the distribution lies in the range \(\mu \pm \sigma\)
  • approximately 95% of the distribution lies in the range \(\mu \pm 2\sigma\)
  • approximately 99.7% of the distribution lies in the range \(\mu \pm 3\sigma\)

Estimates, standard errors, and confidence intervals

Parameters

Parameters are unknown numbers that determine a statistical model

A linear regression

\[ y_i = a + b x_i \]

  • Parameters are:
    • \(a\) (the intercept)
    • \(b\) (the gradient)

A normal distribution representing your data

\[ z \sim \textrm{normal}(\mu_z, \sigma) \]

  • Parameters are: \(\mu_z\) and \(\sigma\)

Estimands

An estimand (or quantity of interest) is a value that we are interested in estimating

A linear regression

\[ y_i = a + b x_i\]

  • We want to estimate values for:
    • \(a\) (the intercept)
    • \(b\) (the gradient)
    • predicted outcomes at important values of \(x_i\)

These are all estimands, and estimates are represented using the “hat” symbol: \(\hat{a}\), \(\hat{b}\), etc.

A normal distribution representing your data

\[ z \sim \textrm{normal}(\mu_z, \sigma) \]

  • Estimands are: \(\mu_z\) and \(\sigma\)
    • Maybe you want to determine the 95% confidence interval - this is also an estimand

Standard Errors and Confidence Intervals

  • The standard error is the estimated standard deviation of an estimate
    • It is a measure of our uncertainty about the quantity of interest

Note

  • Standard error gets smaller as sample size gets larger
    • You know more about the most likely value, the more data/information you collect
    • Standard error tends to zero as sample size gets large enough
  • The confidence interval (or CI) represents a range of values of a parameter or estimand that are roughly consistent with the data

Important

  • In repeated applications, the 50% confidence interval will include the true value 50% of the time
    • A 95% confidence interval will include the true value 95% of the time

Tip

  • The usual 95% confidence interval rule of thumb for large samples (assuming a normal distribution) is to take the estimate \(\pm\) two standard errors

Statistical significance and hypothesis testing

Statistical significance 1

  • Some scientists choose to consider a result to be “stable” or “real” if it is “statistically significant
  • They may also consider “non-signifcant” results to be noisy or less reliable

Warning

I, and many other statisticians, do not recommend this approach.

However, the concept is widespread and we need to discuss it

Statistical significance 2

A common definition

  • Statistical significance is conventionally defined as a threshold (commonly, a \(p\)-value less than 0.05) relative to some null hypothesis or prespecified value that indicates no effect is present.

  • E.g., an estimate may be considered “statistically significant at \(P < 0.05\)” if it:

    • lies at least two standard errors from the mean
    • is a difference that lies at least two standard errors from zero
  • More generally, an estimate is “not statistically significant” if, e.g.

    • the observed value can reasonably be explained by chance variation
    • it is a difference that lies less than two standard errors from zero

Most tests rely on probability distributions

  • We need to relate the measured values in the real world to an appropriate distribution that approximates them

A simple example: The experiment

The experiment

  • Two drugs, \(C\) and \(T\) lower cholesterol1, and we want to compare their effectiveness
  • We randomise assignment of \(C\) and \(T\) to members of a single cohort of comparable individuals, whose pre-treatment cholesterol level is assumed to be drawn from the same distribution (i.e. be approximately the same)
  • We measure the post-treatment cholesterol levels \(y_T\) and \(y_C\) for each individual in the two groups.
  • We calculate the average measured \(\bar{y}_T\) and \(\bar{y}_C\) for the treatment and control groups as estimates for the true post-treatment levels \(\theta_T\) and \(\theta_C\).
    • We also calculate standard deviation for the two groups, \(\sigma_T\) and \(\sigma_C\)

A simple example: The hypotheses

  • We want to know if the treatments have different sizes of effect
    • If they do, there should be a difference between the (average) post-treatment cholesterol level in each group
    • The true post-treatment levels are \(\theta_T\) and \(\theta_C\)
    • We have estimated means, \(\bar{y}_T\) and \(\bar{y}_C\) for post-treatment levels

The hypotheses

  • We are interested in \(\theta = \theta_T - \theta_C\), the expected post-test difference in cholesterol between the two groups \(T\) and \(C\).
  • Our null hypothesis (\(H_0\)) is that \(\theta = 0\), i.e. there is no difference (\(\theta_C = \theta_T\))
  • Our alternative hypothesis (\(H_1\)) is that there is a difference, so \(\theta \neq 0\), (i.e. \(\theta_C \neq \theta_T\))

A simple example: The distribution 1

  • To perform a statistical test, we may assume a distribution and parameters for the null hypothesis
    • We can then test the observed estimate against that distribution to see how likely it is that the null hypothesis would have generated it

The distribution

  • We use a probability distribution reflecting generation of the null hypothesis: \(\theta_C = \theta_T\)
    • This allows us to define a test statistic \(T\) (i.e. a threshold probability of “significance”) in advance
  • We test the estimated value from the experiment (\(\bar{y}_T - \bar{y}_C\)) to calculate a \(p\)-value for our estimate: \(p = \textrm{Pr}(T(y^{\textrm{null}}) > T(\bar{y}_T - \bar{y}_C))\)

A simple example: The null hypothesis

The null hypothesis

  • Assume that the true difference \(\theta\) is normally-distributed with \(\mu_\theta=0\), \(\sigma_\theta=1\)

A simple example: The estimated difference

Observed between post-treatment levels: \(\bar{y}_T - \bar{y}_C = -1.4\)

  • Is this an unlikely outcome given the null hypothesis?

A simple example: A significance threshold

We choose a significance threshold in advance

  • Suppose we set a threshold \(T\) corresponding to the 90% confidence interval (i.e. \(P<0.1\))
    • If the estimate is not in the central 90% of the distribution, we’ll say it’s “significant”

A simple example: Compare the estimate

Compare the estimate to the threshold

  • The estimate lies outwith the threshold, so we call the difference “significant”

A simple example: Another threshold

We choose a significance threshold in advance

  • Suppose we set the threshold \(T\) corresponding to the 95% confidence interval (i.e. \(P<0.05\)) instead?

A simple example: Another outcome

Compare the estimate to the threshold

  • The estimate lies within the threshold, so the difference is “not significant”

A simple example: What changed?

What did not change

  • The null hypothesis was the same
  • The observed estimate of difference was the same

What changed

  • Our choice of significance threshold changed

Significance threshold choice

  • Once the estimate is known, it is always possible to find a threshold that makes it “significant” or “not significant”
  • It is dishonest to select a threshold deliberately to make your result “significant” or “not significant”
  • Always choose and record (preregister) your threshold for significance ahead of the experiment

Tailed tests: two-tailed

Use two tails if direction of change doesn’t matter

  • With a two-tailed hypothesis test, we do not care which direction of change is significant

Tailed tests: one-tailed (left)

Use one-tailed tests when direction matters

  • If we’re testing specifically for a significant negative difference/reduction, use a left-tailed test
  • e.g. if we wanted to know if \(T\) reduced post-test levels with respect to \(C\) at a threshold of \(P < 0.05\)

Tailed tests: one-tailed (right)

Use one-tailed tests when direction matters

  • If we’re testing specifically for a positive difference/increase, use a right-tailed test
  • e.g. if we wanted to know if \(T\) increased post-test levels with respect to \(C\) at a threshold of \(P < 0.05\)

Problems with statistical significance 1

Warning

It is a common error to summarise comparisons by statistical significance into “significant” and “non-significant” results

Statistical significance is not the same as practical importance

  • Suppose a treatment increased earnings by £10 per year with a standard error of £2 (average salary £25,000).
    • This would be statistically, but not practically, significant
  • Suppose a different treatment increased earnings by £10,000 per year with a standard error of £10,000
    • This would not be statistically significant, but could be important in practice

Problems with statistical significance 2

Warning

It is a common error to summarise comparisons by statistical significance into “significant” and “non-significant” results

Non-significance is not the same as zero

  • Suppose an arterial stent treatment group outperforms the control
    • mean difference in treadmill time: 16.6s (standard error 9.8)
    • the 95% confidence interval for the effect includes zero, \(p ≈ 0.20\)
  • It’s not clear whether the net treatment effect is positive or negative
    • but we can’t say that stents have no effect

Problems with statistical significance 3

The difference between ‘significant’ and ‘not significant’ is not statistically significant

  1. At a \(P<0.05\) threshold, only a small change is required to move from \(P < 0.051\) to \(P < 0.049\)
  2. Large changes in significance can correspond to non-significant differences in the underlying variables

Problems with statistical significance 4

The difference between ‘significant’ and ‘not significant’ is not statistically significant

  1. At a \(P<0.05\) threshold, only a small change is required to move from \(P < 0.051\) to \(P < 0.049\)
  2. Large changes in significance can correspond to non-significant differences in the underlying variables

Standard errors, sample size, and statistical significance

Standard errors

Important

We cannot make an infinite number of measurements of \(z\). We can only take a sample.

The mean and standard deviation we estimate in an experiment will not match those of the infinitely large population.

Standard Error (of the Mean)

The standard error of the mean reflects the uncertainty in our estimate of the mean.

When estimating the mean of an infinite population, given a simple random sample of size \(n\), the standard error is:

\[ \textrm{standard error} = \sqrt{\frac{\textrm{Variance}}{n}} = \frac{\textrm{standard deviation}}{\sqrt{n}} = \frac{\sigma}{\sqrt{n}} \]

Standard error and sample size

Tip

Uncertainty in the mean estimate \(\mu\) reduces proportionally to the square root of the number of samples, \(n\)

Standard error and hypothesis testing 1

Hypothesis test statistics

  • Test statistic \(t\) is a point on the distribution representing a significance threshold

\[ t = \frac{Z}{s} = \frac{Z}{\sigma/\sqrt{n}} \]

  • \(Z\) is some function of the data (difference between estimate and true value); \(s\) is standard error of the mean

This is true for many hypothesis test methods

Standard error and hypothesis testing 2

One-sample \(t\)-test

\[ t = \frac{Z}{s} = \frac{\bar{X} - \mu}{\hat{\sigma}/{\sqrt{n}}} = \frac{\bar{X} - \mu}{s(\bar{X})} \]

  • \(\bar{X}\) is the sample mean; \(\mu\) is the hypothesised population mean (being tested)
  • \(\hat{\sigma}\) is the sample standard deviation; \(n\) is the sample size; \(s(\bar{X})\) is the standard error of the mean

Wald test

\[ \sqrt{W} = \frac{Z}{s} = \frac{\hat{\theta} - \theta_0}{s(\hat{\theta})} \]

  • \(\hat{\theta}\) is the estimated maximising argument of the likelihood function; \(\theta_0\) is the hypothesised value under test
  • \(s(\hat{\theta})\) is the standard error of \(\hat{\theta}\)

Standard error and hypothesis testing 3

Hypothesis test statistics

  • Test statistic \(t\) is a point on the distribution representing a significance threshold

\[ t = \frac{Z}{s} = \frac{Z}{\sigma/\sqrt{n}} \]

  • \(Z\) is some function of the data (difference between estimate and true value); \(s\) is standard error of the mean

What happens if we hold \(Z\) and \(\sigma\) constant and vary sample size?

Sample size and hypothesis testing 1

We reject the null hypothesis when $t > \(t_\textrm{crit}\)

  • Suppose we set \(t_\textrm{crit} = 2\) (and \(\sigma=1\), \(n=3\))
  • We can calculate \(t = \frac{Z}{s} = \frac{Z}{\sigma/\sqrt{n}}\) for any value of \(Z\)

Sample size and hypothesis testing 2

The difference (\(Z\)) we need to see to reject the null varies with sample size

  • Set \(t_\textrm{crit} = 2\)
  • \(n=3 \implies Z_\textrm{crit} \approx 1.15\); \(n=15 \implies Z_\textrm{crit} \approx 0.5\)

Statistical significance and effect size

Statistical significance is not the point

  • We can meet any statistical significance threshold for a difference by sufficiently increasing the sample size

Statistical significance is not the same as practical importance

  • Suppose a treatment increased earnings by £10 per year with a standard error of £2 (average salary £25,000).
    • This would be statistically, but not practically, significant

What matters is effect size

In all our experiments we should be concerned not with “statistical significance,” but with how likely it is that, if there is a meaningful effect of the treatment, our experiment will be able to detect it?

  • We need to be concerned with statistical power

Statistical power

Statistical power

Important

Statistical power is defined as the probability, before a study is performed, that a particular comparison will achieve “statistical significance” at some predetermined level (e.g. \(P < 0.05\)) given an assumed true effect size.

The process

  1. Hypothesise an appropriate effect size (e.g. what effect will improve health?)
  2. Determine the \(p\)-value threshold you consider “statistically significant”
  3. Make reasoned assumptions about the variation in the data (e.g. what distribution? what variance?)
  4. Choose a sample size
  5. Use probability calculations to determine the chance that your observed \(p\)-value will be below the threshold (accept the null hypothesis) for the hypothesised effect size

Effect sizes

Power analysis depends on an assumed effect size

  • The true effect size is almost never known ahead of time
    • Determining the effect size is usually why we’re doing the study

How to choose effect sizes

  • Try a range of values consistent with relevant literature
  • Determine what value would be of practical interest (e.g. improvement in outcomes of 10%)

How not to choose effect size

  • DO NOT USE AN ESTIMATE FROM A SINGLE NOISY STUDY!
    • Noisy studies suffer from The Winner’s Curse

The Winner’s Curse 1

A low-powered pilot study

  • Suppose we ran a small pilot study with only a few individuals
  • The study, by design, has low statistical power
    • The variance of the data is relatively large, compared to the true effect size

The Winner’s Curse 2

You get a statistically significant result!

  • You think you won, but you lost! (The Winner’s Curse)
    • The estimate is either eight times too large (at least 16% instead of 2%) or
    • The estimate has the wrong sign (a negative change instead of positive)

The Winner’s Curse 3

The trap

Any apparent success of low-powered studies masks larger failure

When signal (effect size) is low and noise (standard error) is high, “statistically significant” results are likely to be wrong.

Low-power studies tend not to replicate well

Warning

Low-power studies have essentially no chance of providing useful information

We can say this even before data are collected

Published results tend to be overestimates

Statistical power and ethics

It is unethical to under-power animal studies

Under-powered in vivo experiments waste time and resources, lead to unnecessary animal suffering and result in erroneous biological conclusions (NC3Rs Experimental Design Assistant guide)

It is unethical to over-power animal studies

Ethically, when working with animals we need to conduct a harm–benefit analysis to ensure the animal use is justified for the scientific gain. Experiments should be robust, not use more or fewer animals than necessary, and truly add to the knowledge base of science (Karp and Fry (2021))

So how should we appropriately power animal studies?

Statistical power and error 1

We often refer to two kinds of statistical error

Type I Error (\(\alpha\))

  • Type I error is the probability of rejecting a null hypothesis, when the null hypothesis is true
    • Also known as a “false positive error”
  • Represented by the Greek letter \(\alpha\)

Type II Error (\(\beta\))

  • Type I error is the probability of accepting a null hypothesis, when the null hypothesis is false
    • Also known as a “false negative error”
  • Represented by the Greek letter \(\beta\)

Statistical power is \(1 - \beta\)

Statistical power and error 2

Statistical power needs context: the expected error rates of the experiment at a given effect size, e.g.

The experiment has 80% power at \(\alpha = 0.05\) for an effect size of 2mM/L

How to read this

  • “an effect size of 2mM/L”: we are aiming to detect an effect of at least 2mM/L (e.g. blood glucose concentration)
  • \(\alpha = 0.05\)”: we are using a significance test threshold (\(\alpha\), type I error rate) of \(P < 0.05\)
  • “80% power”: we expect the study to report a significant effect, where one truly exists, 80% of the time

Statistical power and error 3

The experiment has 80% power at \(\alpha = 0.05\) for an effect size of 2mM/L

If the drug truly has no effect

  • The test has \(\alpha = 0.05\), so we would expect to reject the null hypothesis incorrectly 5% of the time
  • If we ran the experiment 100 times, we would expect to see a result implying that the drug was effective five times

If the drug truly has an effect

  • The test has predicted power \(1 - \beta = 0.8\), so the type II error rate \(\beta = 0.2\) and we would expect to accept the null hypothesis incorrectly 20% of the time
  • If we ran the experiment 100 times, we would expect to see a result implying that the drug was effective eighty times

Statistical power and sample size 1

What we need, to calculate appropriate sample size

  • An acceptable false positive rate (type I error, \(\alpha\))
  • An acceptable false negative rate (type II error, \(\beta\))
    • This is equivalent to knowing the target statistical power (\(1 - \beta\))
  • The expected effect size and variance
  • The statistical test being performed

Important

We need this information to calculate an appropriate, ethical sample size

Statistical power and sample size 2

Typical funders’ requirements

  • False positive rate \(\alpha = 0.05\)

  • Power \(1 - \beta = 0.8\) (80% power)

  • These are only a starting point - other values may be more appropriate depending on circumstance

Under experimenter control

  • Effect size and variance
  • The appropriate statistical approach

References

References

Bate, Simon T., and Robin A. Clark. 2014. The Design and Statistical Analysis of Animal Experiments. Cambridge University Press.
Karp, Natasha A, and Derek Fry. 2021. “What Is the Optimum Design for My Animal Experiment?” BMJ Open Sci. 5 (1): e100126.
Kilkenny, Carol, William J Browne, Innes C Cuthill, Michael Emerson, and Douglas G Altman. 2010. “Improving Bioscience Research Reporting: The ARRIVE Guidelines for Reporting Animal Research.” PLoS Biol. 8 (6): e1000412.
Kilkenny, Carol, Nick Parsons, Ed Kadyszewski, Michael F W Festing, Innes C Cuthill, Derek Fry, Jane Hutton, and Douglas G Altman. 2009. “Survey of the Quality of Experimental Design, Statistical Analysis and Reporting of Research Using Animals.” PLoS One 4 (11): e7824.